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Abstract 

Set cover, over a universe of size n, may be modelled as a data-streaming problem, where 
the m sets that comprise the instance are to be read one by one. A semi-streaming algorithm 
is allowed only 0(Mpoly{logM, logm}) space to process this stream. For each p ^ 1, we give 
a very simple deterministic algorithm that makes p passes over the input stream and returns 
an appropriately certified {p + l)M^'^^P^^kapproximafion fo fhe opfimum sef cover. More im- 
porfanfly, we proceed fo show fhaf fhis approximafion factor is essentially tighf, by showing 
fhaf a factor better fhan /(p + 1)^ is unachievable for a p-pass semi-sfreaming 

algorifhm, even allowing randomisation. In particular, fhis implies fhaf achieving a ©(logn)- 
approximafion requires O (log n / log log n) passes, which is fighf up fo fhe log log n facfor. 

These resulfs exfend fo a relaxafion of fhe sef cover problem where we are allowed fo leave 
an e fraction of fhe universe uncovered: fhe fighf bounds on fhe besf approximafion facfor 
achievable in p passes torn ouf fo be 0p(min{M^'^iP+^i, 

Our lower bounds are based on a consfrucfion of a family of high-rank incidence geome- 
fries, which may be fhoughf of as vasf generalisations of affine planes. This consfrucfion, based 
on algebraic fechniques, appears flexible enough fo find ofher applications and is fherefore in- 
feresfing in ifs own righf. 
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1 Introduction 


The set cover problem is one of the most basic and well-studied optimisation problems in com¬ 
puter science. It features either directly or in various guises in a wide array of applications, such 
as facility location, information retrieval [2], software test selection, and tableau generation [15]. 
It is also at the heart of a rich theory spanning approximation algorithms [30] and computational 
complexity theory [3], where efforts to understand the complexity of set cover have led to inter¬ 
esting combinatorial and mathematical interactions. In this work, we consider set cover as a "big 
data" problem; specifically, we are concerned with space-efficient algorithms for set cover in the 
well-established data streaming model [26, 4]. This setting has been studied in several recent works, 
including Saha & Getoor [28], Emek & Rosen [12], and Demaine et al. [10]. 

An instance of set cover is given by a pair {X,T), where A is a finite universe with cardinality 
IAI = n and C 2“^ is a finite collection (multiset) of subsets of X with cardinality |J^| = m. 
The pair {X,X') satisfies the guarantee that the sets in X together cover X, i.e., UseJ'S = X. K 
candidate solution to the instance is a subcollection Sol C T) it is said to he feasible if UseSo/ S = X. 
Its cost is defined to be the cardinality |So/|.^ The desired goal is to find a feasible solution while 
keeping its cost small. A feasible solution with minimum possible cost is said to be optimal and 
its cost is called the optimum cost or optimum value of the instance. Henceforth we shall call this 
problem SET-COVER„,;„, or simply SET-COVER. 

It is well-known that finding an optimal solution to SET-COVER is NP-hard [18]; that finding 
an a-approximate solution—defined as a feasible solution whose cost is at most a. times that of 
the optimum—is possible in polynomial time for a = lnn — lnlnn-|-©(l) [29]; and that doing so 
for oi < (1 — e) Inn is impossible unless NP = P [11]. Thus, for traditional Turing Machine com¬ 
putation, the complexity of SET-COVER is essentially fully understood. However, for genuinely 
huge instances of SET-COVER, additional considerations become important: how will the data be 
accessed and how will it be manipulated in a relatively small amount of working memory? 

This motivates a careful study of the complexity of SET-COVER in a data-streaming setting. The 
instance {X,F) is presented as a stream consisting of the sets in F, one at a time; the universe X 
is known in advance, so we may assume that X = [n] := {1,2,..., n}. Representing an instance 
of SET-COVER„^m requires @{mn) bits in general. Thus, in @{mn) bits of space (working memory), 
we could simply run our favourite offline algorithm. The challenge is to work with sublinear — 
i.e., o{mn) —space. A p-pass algorithm may read its input stream up to p times; this parameter 
p, sometimes called the pass complexity, ought to be a small constant, or perhaps O(logn). Of 
course, in addition to space and pass efficiency, we would also want our algorithms to process 
each set quickly, with very simple operations and logic. 

Since Q(n) space is required simply to certify that a computed solution is feasible, we shall 
think of an algorithm as highly space-efficient if it uses 0(n) := 0(npoly{logn,logm}) space. 
Following a convention started with the study of streaming graph algorithms [14], and continued 
in this context by Emek & Rosen [12], we shall call such an algorithm a semi-streaming algorithm. 
Emek & Rosen undertook a detailed study of one-pass semi-streaming algorithms for SET-COVER, 
obtaining nearly tight bounds on the best approximation ratio achievable by such algorithms. In 
this work, we provide tight bounds for the multi-pass case, giving an almost complete under¬ 
standing of the pass/approximation tradeoff for semi-streaming algorithms. In particular, this 
answers an open question explicitly raised by Saha & Getoor [28]. 

^In weighted set cover, each set S 6 has a cost or weight w{S) ^ 0 and the cost of Sol is Y^seSol ®(S). The major 
contributions of this work being lower bounds, we focus on the purely combinatorial setting, which is of course a 
sfrength for lower bounds. 
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1.1 Our Results and Techniques 

A classic result of Johnson [17], refined by Slavik [29], gives a (Inn — Inlnn + 0(l))-approx- 
imation to SET-COVER by a greedy algorithm. Given an instance {X,T), at each step, it adds to 
the current solution the set from T that contributes most, i.e., covers the largest number of as-yet- 
uncovered elements. Notice that this can be implemented as a semi-streaming algorithm, using 
one pass for each step, but this leads to Q(n) passes, which is ridiculously expensive. Saha & 
Getoor [28] gave a different algorithm, which guarantees an O (log n)-approximation using only 
0(log n) passes. Emek & Rosen [12] asked how good an approximation is possible for a one-pass 
semi-streaming algorithm. They showed that an approximation ratio of O(v^) is achievable and 
that the ratio must be for every constant ^ > 0; Section 1.2 adds some detail. Our first 

result generalises their upper bound, trading off additional passes for improved approximation. 

Result 1 (Formalised as Theorem 2.5). In p passes, within semi-streaming space bounds, we can com¬ 
pute a [p + l)n^^^P+'^)-approximate solution to SET-COVER together with an appropriate “certificate of 
coverage. ” 

The algorithm behind Result 1 is a variant of the greedy approach wherein each pass picks sets 
that contribute above some well-chosen threshold for that pass, and the sequence of thresholds is 
geometrically decreasing. This kind of thresholding is itself a variant of ideas introduced by Gor- 
mode, Karloff, and Wirth [9] in a non-streaming context. Our algorithm needs one final "folding" 
trick that considers the final two thresholds in the sequence in a single pass. 

The Emek-Rosen algorithm solves a more general problem, with set weights and a relaxed 
feasibility condition (partial coverage, which we describe below). For the basic combinatorial SET- 
COVER problem, our algorithm nevertheless makes a (small) contribution even in the one-pass 
case, with the simplicity of its logic as compared to Emek-Rosen: our logic, being a variant of the 
basic greedy approach, is arguably easier to implement and analyse. But most importantly, this 
algorithm sets the stage for our main result, which gets at the pass complexity of the problem. 

Result 2 (Main result, formalised as Theorem 3.8). In p passes, approximating the optimum of a SET- 
COVER instance to a factor smaller than 0.99 /{p requires more than semi-streaming space. 

This applies even to the decision problem of distinguishing a small optimum value from a large one. 

Results 1 and 2 together provide a near-complete understanding of the power of each addi¬ 
tional pass in improving the quality of an approximate solution to SET-COVER. Saha & Getoor had 
posed the problem of obtaining this kind of tradeoff as an open question. Result 2 immediately 
implies that obtaining an 0(logn)-approximation under semi-streaming space bounds requires 
Q(logn/loglogn) passes, almost matching the pass complexity of the Saha-Getoor algorithm 
(or, for that matter, the algorithm behind our Result 1). 

In establishing Result 2, we invent a family of novel combinatorial structures that we call ed¬ 
ifices. To explain these, we first consider p = 1. In this case, an Cl{^yn) bound follows from a 
reduction from the INDEX problem in communication complexity, via set systems based on affine 
planes of finite order.^ A "hard instance" for one pass consists of a family of sets of two different 
sizes: one "large" set and many "medium" sets with very small pairwise intersections. The family 
of lines in F^, where F is a finite field, gets us most of the way towards the desired properties. 
To generalise this to p > 1, we reduce from the multi-party communication problem POINTER¬ 
JUMPING. For this reduction, we need a more elaborate set system with sets of many different 
sizes (similar to contribution thresholds in the multi-pass algorithms) and a tree-like incidence 

^Emek & Rosen also use affine planes, but differently, and obtain an n(n^t 2 -S'^ bound versus our ^/n /(4 -|- J). 
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structure, plus a small-intersection property as before. Very roughly, for p = 2, we start by con¬ 
sidering quadric surfaces inside F^, and then lines in F^ lifted onto these surfaces; for higher p, 
we start with the appropriate extensions of these ideas to higher-degree algebraic varieties. These 
varieties form a certain incidence geometry—we call it an edifice—that is a vast generalisation of 
affine planes. Bounding the sizes of certain pairwise intersections between these varieties is the 
most technical part of this work. 

Following Emek & Rosen, we also study partial set covers. In the PARTlAL-COVER„^m,£ problem, 
an instance I = {X,T consists of X, T, and a parameter e G [0,1]. We require a (1 — e)-partial 
cover of X: a collection Sol C X that covers at least (1 — £)\X\ elements. A solution Sol is a- 
approximate if \Sol\ ^ ix\Opt\, where Opt is a minimum-cost total set cover for {X,F). 

Result 3 (Formalised as Theorems 4.1 and 4.4). The smallest a for which a semi-streaming algorithm 
can compute an a-approximate (1 — £)-partial cover is in ©p(min{n^^*^P+^),e^^^P}). The lower bound 
applies to a decision problem of distinguishing a small total cover from a necessarily large partial cover. 

The upper bound in Result 3 builds on the one-pass Emek-Rosen algorithm; thus we lose 
the extreme simplicity of the algorithm behind Result 1, but gain the ability to handle weighted 
instances. The main contribution is again the lower bound. It requires a reexamination of the 
edifices constructed for establishing Result 2 and proving that they satisfy additional geometric 
properties. These properties then allow us to build new edifices with different parameters that are 
suited to the problem at hand. This construction shows the power of the axiomatic approach we 
take in defining edifices. 

We note in passing the minor result (formalised as Theorem 3.9) that a tweak to Result 2 
gives a rounds/approximation tradeoff for a two-player communication version of SET-COVER 
a la Nisan [27] and Demarne et al. [10]. 

1.2 Related Work 

The quantification of savings afforded by extra streaming passes dates back to Munro & Pater¬ 
son [25], who studied pass/space tradeoffs for median-finding. This general topic remains cur¬ 
rent [16, 8, 22, 7]. 

Efforts to understand the hardness of SET-COVER have led to many deep insights and connec¬ 
tions with various kinds of mathematics. Our technical contributions continue this tradition. In 
the series of hardness-of-approximation results beginning with Lund & Yannakakis [21, 13, 24], 
recently culminated in Dinur & Steurer [11], each result required new insights into PCPs and 
parallel repetition; for details, see the latter paper and the references therein. Closer to this 
work, Nisan [27] initiated the study of SET-COVER as a (two-player) communication problem and 
showed that, for every constant S > 0, computing a — S) log 2 n-approximation to SET-COVER„,m 
requires Cl{m) randomised communication. His "hard instances" used m exp{^/n). Nisan's 
original motivation was combinatorial auctions, but his result can be interpreted in the data- 
streaming setting as saying that a semi-streaming (| — ci) log 2 ft-approximation is impossible, re¬ 
gardless of the number of passes. Demaine et al. [10] showed that deterministic streaming algo¬ 
rithms achieving a ©(l)-approximation require Cl(mn) space, thereby ruling out sublinear-space 
solutions altogether. 

All of the above lower bounds have, at their core, some variant of an old combinatorial con¬ 
struction: namely, that of a set system with the so-called r-covering property [21]. ©ur own com¬ 
binatorial constructions (of edifices) play an analogous role in our lower bounds, but are quite 
different at a technical level. In particular, they result in SET-COVER,, instances where m = n®^^\ 
Their closest relative is the construction in Emek & Rosen [12] based on lines in an affine plane. 
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Turning to upper bounds, traditional (offline) approximation algorithms for SET-COVER are 
discussed at length in Vazirani [30]; see also Slavik [29] and the references therein. Alon et al. [1] 
studied SET-COVER in an online setting, focussing on competitive ratios rather than space con¬ 
siderations, but under a fundamentally different input model: the sets are known in advance 
and elements of the universe X arrive in a stream. The setting we study was first considered by 
Saha & Getoor [28], who called it "set streaming." They gave a 4-approximation algorithm for 
MAX-k-COVERAGE, the problem of choosing k sets from the stream so as to maximise the cardi¬ 
nality of their union. Iterating this algorithm for O(logn) passes immediately gives an O(logn)- 
approximation for SET-COVER. Cormode, Karloff, and Wirth [9], targeting external-memory effi¬ 
ciency, developed a "disk-friendly greedy" (DFG) algorithm for SET-COVER. In short, each step 
of DFG adds some set whose contribution is at least 1 / j6 times the maximum. As designed, DFG 
yields an 0(log^ n)-pass, {1 + fi Inn)-approximate, 0{n logn)-space streaming algorithm. 

The single-pass semi-streaming setting was first, and thoroughly, studied by Emek & Rosen [12]. 
Indeed, their results extend to partial-COVER, as well as item- and set-weighted variants. Their 
algorithm, like ours, computes a certificate of coverage that indicates, for each item, which set (if 
any) covers it: the implied solution Sol covers a 1 — e (weighted) fraction of X and has w{Sol) = 
0{miTL{l/e, ^/n}w{Opt)). On the lower bound side, they prove that for every e ^ \l\fn, a 
randomised semi-streaming algorithm that certifies an (unweighted) a-approximate (1 — e) -cover 
must have a. = Cl{l/e). Outputting only the sets in a solution (without a certificate) still requires 
a = Q(£~^ loglogn/logn). The still-weaker problem of approximating the optimum value re¬ 
quires a = Q(n^^^^^) for every constant ^ > 0. Emek & Rosen remark [12, footnote 3] that they 
can show this only for SET-COVER, and not for (1 — e)-PARTlAL-COVER with e 3> 1/^/n. Gompare 
these lower bounds with our Results 2 and 3, specialised to p = 1. 

The main result of Demaine et al. [10], whose deterministic lower bound we have discussed, 
is a randomised sublinear-space, though not semi-streaming, algorithm for SET-COVER. It achieves 
an 0(4^/^)-pass, O(4^/“^p)-approximation in 0{mn^) space, where p is the approximation ratio of 
whatever offline SET-COVER algorithm we are prepared to run. 

2 A Simple Deterministic Multi-Pass Algorithm 

Model of computation. An instance of SET-COVER„^m consists of sets Si,..., Sm ^ [n], specified 
as a stream of tokens (z, S,), where S; is described in some reasonable way (either as a list of 
its elements or as a characteristic vector) and i is the ID of S,. The IDs need not appear in the 
order The desired output is a set Sol C [m] consisting of the IDs of sets that together 

cover [n], plus a certificate: an array Coverer[l ... n] in which, for each v, Coverer[x] is the ID of a 
set that covers j. Strictly speaking, Sol is redundant because it can be computed from Coverer, but 
keeping track of it explicitly aids exposition. ^ 

Recall that a semi-streaming algorithm is allowed 0(n) := 0(npoly{logn,log?n}) bits of 
space. This clearly suffices to represent each of Sol and Coverer, which need only 0(n log m) bits, 
under the sensible assumption that |SoZ| ^ n. An ideal semi-streaming algorithm for SET-COVER 
would use no more space than this, asymptotically, and our Algorithms 1 and 2 achieve this space 
bound. 

2.1 Algorithm and Analysis 

As promised, we begin by giving a very simple deterministic p-pass, semi-streaming, "progres¬ 
sive greedy" algorithm that returns a (p -|- l)n^^(P+^)-approximation. The basic idea is that the 
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first pass is very conservatively greedy, picking a set into the solution iff its contribution is at least 
some large number ti (i.e., it covers at least ti as-yet-uncovered elements); the second pass repeats 
this logic with a threshold T 2 < Ti, making it slightly less conservative; and so on. Choosing suit¬ 
able thresholds gets us to a p-pass pn^^^-approximation. This is the naive version of progressive 
greedy Our final algorithm "folds" the last two passes of this naive version into a single pass, 
achieving the desired bound. 


Algorithm 1 Naive version of "progressive greedy" algorithm for SET-COVER, in p passes 

1: procedure GREEDYPASS(sfream a, threshold t, set Sol, array Coverer) 

2: foreach {i, S') in a do 

3: C <— {x ; Coverer[x\ ^0} > C is the set of already covered elements 

4: if |S \ C| > T then 

5: Sol ^ Sol U {1} 

6: foreach x 6 S \ C do Coverer[x] <— i 

7: procedure PROGGREEDYNAlVE(stream a, integer n, integer p ^ 1) 

8: Coverer[l.. .n] <— 0"; Sol <— 0 

9: for ; = 1 to p do GREEDYPASS(cr, Sol, Coverer) 

10: output Sol, Coverer 


For ease of reading we have not optimised the per-token processing time in our pseudocode. 
Clearly, in each pass, each set S can be processed in 0(|S|) time in a RAM-style machine with 
word size n(log m). 

To analyse Algorithm 1, fix an arbitrary instance of SET-COVER„^m. Each call to GreedyPass 
makes a single pass through cr, considering every set S. Note that the contribution of S in such a 
pass is the quantity | S \ C |, computed in line 4, which is the number of new elements that S covers. 
Let Opt C [m] be an optimum solution. For ease of exposition we will pretend that Opt and Sol are 
collections of sets from the input instance (they are in fact collections of IDs of such sets). 

Definition 2.1. A (r, p)-bounded pass is a run of GreedyPass with threshold t where, if Co is the 
set of covered elements at the start of the pass, then for all S in (7 we have | S \ Cq | ^ pr. 

Lemma 2.2. A {r,p)-bounded pass adds at most p\Opt \ sets to Sol. 

Proof. Put D = [n] \ Cq. Each set in Opt includes at most pz of the elements in D, yet the sets in 
Opt together cover D. Therefore |Opf| ^ |D|/ (pz). Meanwhile, in this pass, each set added to Sol 
includes at least z elements of D, so the pass adds at most |D|/t ^ P\Opt\ sets to Sol. □ 

Lemma 2.3. Algorithm 1 is a p-pass semi-streaming pn^^P-approximation algorithm for SET-C0VER„^,„. 

Proof. The algorithm's correctness and 0{n) space bound are obvious, so we focus on the approx¬ 
imation ratio. We claim that for each j G [p], the jth pass of Algorithm 1 is , n^^P)-bounded. 

Let us prove this claim. Put Zj = n^N/p. For j = 1, the precondition required by Definition 2.1 
is trivially satisfied. For larger j, consider an arbitrary set S in cr and let Cq be as in Definition 2.1, 
for the jth pass. If S were added to Sol in an earlier pass, then | S \ Cq | =0. If not, then by the 
logic of GreedyPass, set S's contribution was less than Zj-i during the (; — l)th pass. Since Cq 
is a superset of the set of elements that had been covered when S was processed in the (; — l)th 
pass, we have |S \ Co| < Zj-i = n^^Pzj. 

Having proved the claim, it follows from Lemma 2.2 that each pass adds at most n^^P\Opt \ sets 
to Sol. Therefore, in the end we have |So/| ^ pn^^P\Opt\, as required. □ 

In fact, since the first pass adds at most n^^P sets, we have |SoZ| ^ n^^Pjl + (p — 1) |Opf|). 


5 










Folding the last two passes. The final pass of Algorithm 1 picks a set merely for making a nonzero 
contribution. When there are at least two passes, this final-pass logic can be "folded into" the 
penultimate pass as follows. During the pih pass of a p -|- 1-pass scheme, we run GreedyPass 
as usual and additionally, in parallel, run a second instance of GreedyPass with threshold 1 that 
builds an alternate solution Alt (certified by a new array Backup, analogous to Coverer), starting 
from 0. Thus, Alt is the solution that a 1-pass version of Algorithm 1 would have built. At the 
end of the penultimate (pth) pass, Sol might have left some elements of X uncovered. We fix 
this by post-processing: for each such element y, we add to Sol the set in Alt that covered x; this 
information can be read from Backup. Algorithm 2 implements this very idea. 


Algorithm 2 Progressive greedy algorithm for SET-COVER in p passes 

1: procedure PROGGREEDY(stream a, integer n, integer p ^ 1) 

2: Coverer[l.. .n], Backup[l.. .n] <—0”; Sol,Alt ■(— 0 

3: for ; = 1 to p — 1 do GREEDYPASS(a', Sop Coverer) 

4: in parallel, do GREEDYPASS((r, Sop Coverer) and GREEDYPASS((r, l,Alt, Backup) 

5: for X = 1 to n do > Post-processing: elements not covered by Sol will get covered by sets from Alt 

6: if Coverer[x] = 0 then 

7: Sol <— Sol U Backup[x] 

8: Coverer[x] <— Backup[x] 

9: output Sol, Coverer 


Lemma 2.4. For every stream a, the output o/ProgGreedy((7, p) in Algorithm 2 is identical to that of 
PROGGREEDYNAtVE((7, p + 1) in Algorithm 1. 

Proof. Fix an input stream cr. Let Ai and A 2 denote, respectively, the invocation of Algorithm 1 as 
ProgGreedyNaive (cr, p + 1) and the invocation of Algorithm 2 as PROGGREEDY(cr, p). Let Soli 
be the value of Sol after p passes of Ai. It is immediate that Soli is also the value of Sol in A 2 just 
before the post-processing loop in lines 5 to 8. 

Let Covi and C0V2 denote, respectively, the final output values of the array Coverer in Ai and A2. 
Let C = UseSo/i S. Our above observation says that Copi[y] = Cov2[x] for all y G C. It remains 
to prove that the same equality also holds for all y G [n] \ C. But this, too, is immediate from the 
observation that for each y G [n] \ C, each of Covi [y] and Backup[x], and thus Cop2[y] as well, is set 
to the earliest set in a that contains y. □ 

Theorem 2.5. There is a p-pass, 0{n log m)-space algorithm that, for every instance 0 /SET-C0VER„^„„ 
outputs a feasible solution Sol with |SoZ| ^ n^Ap+P(^i -y p|Opf|) ^ {p + |Opf|. 

Proof. This follows immediately by combining Lemma 2.3 with Lemma 2.4. □ 

Folding three passes? It is natural to wonder whether the above "folding" idea can be taken 
further, achieving an even better pass/approximation tradeoff. As it turns out, we cannot fold (the 
last) three passes into one. The most convincing proof is the lower bound that we shall establish 
in Section 3. 

As designed, the algorithm cannot be sure what the contribution of a set will be in a particular 
pass until it actually sees this set in that pass. In the last pass, however, we need only know that 
the contribution is non-zero: after the penultimate pass, if Coverer[x] = 0 and Backup[x] = i, we 
know "in advance" that set S, has non-zero contribution. 
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2.2 Tightness of Analysis 

The lower bound in Section 3 shows that the approximation ratio guaranteed by Theorem 2.5 is 
asymptotically optimal for p = 0(1) passes. But if p is allowed to grow with n, then there remains 
a small 0(p^) discrepancy between that upper bound and the lower bound we shall eventually 
prove in Theorem 3.8. We can, however, prove that our analysis of the approximation guarantee 
of Algorithm 1 is tight. 

Theorem 2.6. For each integer p ^ 2 and q large enough, there is an instance In,m o/SET-COVER„,m, 'u^ith 
n = qP — 1 and m ^ pq, such that In,m admits a set cover of size 1, whereas Algorithm 1, using p passes 
and running on In,m, returns a solution with p{q — 1) % pn^^P sets. 

Proof. Put X = [q]P \ {{q,q,... ,q)}. For each; G [p] and y & [q — 1], define the sets 

Xj ^ {{xi,...,Xp) e X : xi = ■ ■ ■ = x;_i = q}, 

Sf = {(xi,...,Xp) ex : xi = ■■■ = xj-i = q A xj=y}. 

Then X = Xi D ■ ■ ■ D Xp and Sj C Xj. 0bserve that these sets Sj are pairwise disjoint and 

partition X. Further, |Sj| = qP~t and \Xj\ = qP~t+'^ — i for all j,y. 

Let £7; be the stream consisting of the sets {Sj : ye [(; — 1]} in some arbitrary order. Let a he the 
stream consisting of CTp followed by (7p_i and so on, down to cri, and finally the set X. Consider the 
SET-COVER„^m instance I„pn defined by a: it satisfies n = \X\ = qf — 1 and m = p(£; — 1) + 1 ^ p^;, 
as claimed. Since the entire universe X occurs as a set in In,m, the optimum set cover consists of 
just that one set. 

Now consider the behaviour of Algorithm 1 on a. For each ; G [p], let T; = be the 

threshold in the ;th pass. We claim that 

cf>-i _ 1 < r;- ^ qV-t. (1) 

The second inequality in (1) is easy to see: T; = {fqf — ^ ^ qP~h The first inequality is 

obvious when; = p, so suppose that 1 ^ ; < p. Consider the function 

Gpp{x) = {xP - if-j/P - {xP-i - 1 ). 

A routine calculation shows that the derivative Gjp(v) = (p — j)xP~^[(xP — l)~i^P — x^t). For 
V ^ 1, we have {xP — if^P < x, so {xP — l)~t^P > x~F, therefore G'-^{x) > 0. Since Gpp(l) = 0, 
we now conclude that Gj^p{q) > 0, which gives us the first inequality in (1) and proves the claim. 
We can now see that the pass satisfies the following properties. 

1. At the start of the pass, the set of uncovered elements is precisely Xj. 

2. Each set in cXp,...,Cjj^i makes a contribution equal to its cardinality. Therefore the largest 
such contribution is qP^l~^ < qP^l — 1 < T;, by (1). 

3. Each set in u; makes a contribution equal to its cardinality, which is qP~t ^ Tpby(l). 

4. Each set in £7;_i ,... ,cri makes a contribution of zero. 

5. The set X, which arrives at the end of a, makes a contribution of qP t -I < Tj, by (1). 

6. Therefore the sets added to Sol during the pass are exactly the sets in uy. 
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The validity of these properties can be formally proved by backward induction on j. The details 
are routine and tedious, so we omit them. 

Based on the above properties, we see that Algorithm 1 produces a solution consisting of all 
sets in all substreams (Tj. The number of sets in this solution is — 1), as claimed. □ 

3 The Basic Lower Bound 

In this section we establish our main result, which gives a strong lower bound on the best approxi¬ 
mation ratio achievable by a semi-streaming algorithm for SET-COVER. Our lower bound gives the 
optimal dependence of this ratio on n. Moreover, for p passes, our lower bound is only about p^ 
times smaller than our upper bound in Theorem 2.5. In particular, when p = 0(1), the lower 
bound is asymptotically optimal. 

3.1 Warm Up: One-Pass Algorithms 

Our proof is based on a fairly technical combinatorial construction. To motivate it, let us first 
outline a simple proof of a one-pass lower bound. We start with the well-known index (or IDX) 
problem in communication complexity, where Alice must send Bob a (possibly random) message 
about her n-bit string x, so that Bob, who holds an index h G [n], can output Xh (the hth bit of x) 
with high probability. A textbook result [20] is that this requires Alice to send Cl(n) bits. To reduce 
IDX to SET-COVER, we construct a universe A and a family of n distinct sets Si,..., S,, C A. Alice 
encodes x as the stream of sets {S, : x, = 1}, and Bob encodes h as a "stream" of just one set: 
X\Sh- Alice's stream followed by Bob's is an instance of SET-COVER. 

When x/j = 1, this instance clearly has \Opt\ = 2. We can force \Opt\ to be much larger when 
Xfj = 0 if we make each |S, | large and each IS; n s j\ small (for i ^ j): since Alice's stream is 
missing S;„ it will take "many" sets Si, i ^ h, to cover the elements of Sjj. 

Incidence geometry gives us an elegant construction of a collection {Si,... ,Sn} with these 
properties. Consider the lines of an affine plane of order q, with q a prime power. More explicitly, 
let Fg denote the finite field with q elements, A = F^, n = | A| = q^, and {Si,..., S„} be some 
collection of n distinct lines out of the q^ + q such lines in F^. Then each |S;| = q and each 
|S; n Sjl ^ 1, for i 7 ^ j. In particular, x^ = 0 now implies that |Opf| ^ q = \fn. Therefore 
approximating such a SET-COVER instance to a factor smaller than \/n/2 is enough to solve IDX, 
whence an algorithm achieving such approximation must use Q(n) space. 

To rule out a semi-streaming algorithm we must prove a stronger, space, lower bound. 

A simple tweak achieves this: sticking with the universe F^, replace the lines in the above con¬ 
struction with degree-2 algebraic curves, say. This preserves the essential dichotomy between 
large |Sj| and small jS; fl Sy| while allowing us to reduce from an IDX instance on bits. 

The one-pass lower bound proof we have just outlined is arguably more straightforward than 
the Emek-Rosen proof [12]. Though both proofs begin with the affine plane, our builds an explicit 
set system, rather than relying on a probabilistic argument, and reduces directly from IDX, rather 
than a employing bespoke entropy calculations, leading to a more modular proof. But there is 
a far more important takeaway from our proof: the observation that employing higher-degree 
curves adds great flexibility to the construction. Exploiting this observation to its fullest allows 
us to handle multi-pass algorithms by greatly generalising the construction, moving from affine 
planes to more abstract incidence geometries that we call edifices (Definition 3.3 below). Edifices, 
like affine planes, are examples of Buekenhout geometries [6]. 
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3.2 Multi-Player Tree Pointer Jumping 

A popular source problem for multi-pass streaming lower bounds is the communication problem 
multi-player tree pointer jumping, which generalises IDX. Let T be a rooted tree with k ^ 2 layers of 
vertices, where a vertex is in layer £ if it is at distance exactly k — £ from the root (thus, the root is in 
layer k) and every leaf is in layer 1. The pointer jumping problem on T, denoted MPt^, is a /c-player 
number-in-hand communication game involving players named PLRi,... ,PLR;f For l<i ^k, 
PLRj's input specifies one pointer (i.e., out-edge) at each vertex in layer j; by definition each such 
pointer leads to a vertex in layer j — 1. Furthermore, PLRi's input specifies a bit at each layer-1 
vertex; these bits are called leaf bits. Given such an input, tt, let T\„ denote the subgraph of T 
defined by retaining only those edges of T that correspond to pointers in tt. Then Tj/r contains a 
unique root-to-leaf path, ending at a leaf say. The desired output corresponding to tt, denoted 
MPj7(7r), is defined to be the leaf bit at Vn- 

The communication game involves players announcing messages on a shared broadcast chan¬ 
nel, according to a public-coin randomised protocol. The protocol proceeds in rounds, where a 
round is defined as one message each from PLRi,..., PLR^t/ speaking in that order. The last mes¬ 
sage of the protocol must be a single bit, which is defined to be the protocol's output. An [r, C, e]- 
protocol for MPlj- is defined to be one in which 

• there are at most r rounds of communication; 

• within each round, the total number of bits communicated is at most C; and 

• the protocol's output equals MPjj(7r) with probability at least 1 — £. 

Definition 3.1. The r-round randomised communication complexity of MPJ^ is defined to be 
R''(mpJ 7 ’) := min{C : there exists an [r,C, j]-protocol for MPJr}. 

Intuitively, if players trying to solve MPIt- are restricted to a "small" amount of communication 
per round, then because they are forced to speak in the "wrong" order, in the first round the 
only player who is able to convey "useful" information is PLR^, in the second round the only 
such player is PLR;t_i, and so on. Therefore, if the protocol is further restricted to k — 1 rounds, 
PLRi rarely gets a chance to convey useful information and so the protocol's error probability 
should be high. This intuition was formalised in the round elimination ideas of Miltersen et al. [23]. 
Using these ideas and a direct sum argument, Chakrabarti, Cormode, and McGregor [8] proved 
a distributional communication complexity lower bound for MPJ. We only need the consequent 
randomised communication complexity bound, stated below. 

Theorem 3.2 ([8, Theorem 4.5]). Let T be a complete t-ary tree with k ^ 2 layers of vertices. Then 
=ci{t/k^). □ 

3.3 Reduction to Set Cover via Edifices 

Definition 3.3. A {k, d, q, t)-edifice T over a universe A is a rooted tree, together with an associated 
collection of sets called the varieties of the edifice, satisfying the following properties. 

(El) T is a complete f-ary tree, i.e., every non-leaf vertex has exactly t children. 

(E2) T has k levels (equivalently, depth k — 1), numbered 1 through k from leaves to root. 

(E3) Each vertex p of T has an associated set C X, called the variety at v. 

(E4) If u is the parent of v, then A X„. If r is the root of T, then Xy = X. 
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(E5) If z is a leaf of T, then |Xz| ^ q. 

(E 6 ) Eor each leaf z of T and each vertex v not an ancestor of z, we have |X 2 ;nXi,| ^ d + k — 1. 

The {k,d,q,t)-ediiices that interest us will have k ^ d q t. In particular, if d + k ^ q 
and f ^ 2, it is easy to prove from (E4), (E5), and (E 6 ) that varieties at distinct vertices are distinct 
(as sets). Eor readers familiar with incidence geometry, we note that these varieties then form 
a Buekenhout geometry [ 6 ] of rank k, where the type map sends each variety to the level of its 
corresponding vertex and the incidence relation is symmetrised set inclusion. Thus our notion of 
an edifice generalises affine planes, which we used in our warm-up proof: an affine plane over 
is a ( 2 , 0 , q, q^ -|- ij)-edifice over the universe F^. 

Theorem 3.4. Suppose there exists a [p + l,d,q,t)-edifice T with q > {p d){p 1). Then every 
randomised p-pass streaming algorithm that, with probability at least 2/3, approximates SET-COVER to a 
factor smaller than q/ {{p d) {p + 1)) must use at least (MPi-j-) / p = Q(f/p^) bits of space. 

Proof. Let the edifice T be over a universe X. We shall transform an input n to MPJ 7 - into an 
instance T{tc) of SET-COVER on the universe X, with each set in X( 7 r) being assigned to one of 
PLRi,. . . ,PLRp.|_i. 

The transformation is as follows. Let w be a vertex of T in layer j ^ 2. Then n specifies a 
pointer from u to some vertex, say v. We encode this pointer as the set Xu \ X^ and assign this 
set to PLRj. We perform this encoding for each vertex in layers 2 and higher. Eurthermore, we 
encode the leaf bits of n as the collection of sets {X^ : tt specifies a 'V at leaf z} and assign all sets 
in this collection to PLRi. Einally, we assign every singleton subset of X to PLRi. This completes the 
specification of our SET-COVER instance, which is valid thanks to the inclusion of the singletons. 

Let Pp+i, ..., Pi be the unique root-to-leaf path in T\ n, with Vj being in layer j, for each j. Put 
Xj = Xuj, for each j. By (E4), X = Xp+i O • • • D Xi, so the encodings of the pointers at Pp+i,..., P 2 

together cover \ ^j-i) = \ Now suppose that MPj 7 -( 7 r) = 1. Then the encoding of 

the leaf bits includes Xi, so T{7z) has a set cover of size Qi := p -|- 1 . 

Next, suppose that MPJ 7 -( 7 r) = 0. A set cover must, in particular, cover Xi. However, the 
encodings of the pointers at Pp+i,..., P 2 are all disjoint from Xi and the encoding of the leaf bits 
does not include Xi. Therefore, Xi must be covered using only singletons and sets corresponding 
to non-ancestors of Pi. Eor each such non-ancestor, y, the corresponding set in X( 7 r) is a subset of 
the variety Xy. By (E 6 ), such a set covers at most d p elements of Xi whereas, by (E5), \Xi\ ^ q. 
Therefore every set cover in T{tt) uses least Qo q / {d p) sets. 

It follows that approximating even the optimum paZwe of X( 7 r) to a factor smaller than Qo/Qi = 
q/{{p + d){p 1 )) is sufficient to determine MPj 7 -( 7 r). 

Let A be a p-pass ^-error randomised streaming algorithm that approximates SET-COVER this 
well, using at most s bits of space. The players can solve MPJ 7 - as follows. On input tt, each player 
follows the above encoding scheme so that players jointly arrive at the SET-COVER instance X( 7 r), 
with sets assigned amongst the players. They simulate the execution of A on the stream a obtained 
by taking PLRi's sets, followed by plR 2 's sets, and so on. Each time the execution of A moves off 
one player's portion of a, that player broadcasts the memory contents of a. This simulation uses 
one communication round per streaming pass, and spends sp bits of communication per round. 
Therefore it yields a [p,sp, ^j-protocol for MPJ 7 -( 7 r), whence sp ^ RP(mpJ 7 -). □ 

3.4 Construction of an Edifice 

Theorem 3.5. Let k, d, and q be integers with k ^ 1, d ^ 0, and q ^ d + k, with q being a prime power. 
Then there exists a {k, d, q, q'^^^{l — 1 /q))-edifice. 
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Proof. We shall construct an explicit edifice over the universe X = The varieties of our edifice 
will be certain well-structured varieties in the sense of algebraic geometry, i.e., solution sets of 
polynomial equations. Write the coordinates of a generic point in as (v, i/i,..., i/jt_i). An edificial 
equation of rank i is defined to be an equation of the form 

where ifzi,... ,Zi) is a homogeneous linear form over F^ whose Zj-coefficient is nonzero and fj{x) 
is a monic polynomial in F^^[v] of degree exactly d + j. Equation (2) is abbreviated as [[£,■ : fk-i}- 

Notice that irrespective of the value of i there are exactly d + k coefficients appearing on the 
right-hand side of eq. (2), one of which must be nonzero. There are exactly t := q‘^^^{\ — \/q) 
ways to choose these coefficients, leading to exactly t distinct edificial equations of each rank. 

Let T be a rooted complete f-ary tree with k levels, the root r being at level k. For 1 ^ i ^ k — 1, 
for each level-(/ -|- 1 ) vertex v of T, label each of the t edges leaving v with one of the t distinct 
rank-z edificial equations. Associate a variety with vertex v as follows. Let Xr = X. If v r, 
let Xv be the variety defined by the set of edificial equations labelling the edges on the path from r 
to V. We shall show that T, with these associated varieties, forms a {k,d,q,t)-edifice. Certainly, 
properties (El), (E2), (E3), and (E4) are immediate. The following observation will be helpful in 
establishing the remaining properties. 

Observation 3.6. Suppose x = {x,yi,..., yjt-i) satisfies the edificial equations : fk-ii, ■ ■ ■ ,\^j '■ fk-j\ 
for some j with 1 ^ j ^k — 1. Then there exist linear forms A,(zi,... ,z,) over F^ such that 

yi = l^i^j. (3) 

Therefore each ofyi, ■ ■ ■ ,yj is determined by x. 

For the rest of this proof let z be a leaf; let x = {x,yi,..., yk-i ) G Xz be an arbitrary point in the 
variety at z and let \£i : fk-ii, ■ ■ ■, \£k-i • /il be the edificial equations defining X^. We record the 
following corollary of 3.6. 

Observation 3.7. The point x is completely determined by its first coordinate x. 

It follows that for each a G Xz contains exactly one such point x with x = a, whence 
|Xz| = |F^| = q. This establishes property (E5). 

Property (E 6 ) requires a more careful examination of the form of the edificial equations. Con¬ 
sider a vertex v that is not an ancestor of the leaf z. Let u be the highest (by level) ancestor of v 
that is still not an ancestor of z. Since Xu A X^,, it suffices to prove that |Xz n Xj, | ^ d + k — 1. Sup¬ 
pose u is at level j < k. Then Xu is defined by the k — j — 1 highest-ranked edificial equations that 
define X^ (which are of ranks k — 1 through j + 1) plus an additional rank-; equation l£P : fj^_j}, 
where either £j / £P or fk-j / or both. 

Suppose that £j = £d, so that fk-j 7 ^ ff_j. Each point x = {x,yi,... ,yk-i) G Xz n X^ must, in 
particular satisfy \£j : fk-ji and \£j : . Comparing these two equations gives 

£j{yi,...,yj-i,fk-j{x)) = yj = £j{yi,...,yj-i,f^_^{x)) (4) 

^ £j{d,... ,0, fk-j{x) — fj^_j{x)) = 0 

^ ~ fk-j{^) = d, (5) 

because the linear form ^;(zi,.. .,Z;) is required to have a nonzero Zy-coefficient. The left-hand 
side of eq. (5) is a nonzero univariate polynomial of degree at most d -\-k — j, whence it has at most 
d -\-k — i roots in Wq. By 3.7, it follows that |Xz H Xj, | ^d + k — j^d + k — 1. 
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Finally, suppose £j 7 ^ We now make the crucial observation that 

/i (x),..., /jt_i (x) are linearly independent over , ( 6 ) 

which holds because these polynomials have distinct degrees. With this in mind, examining 
eqs. (2) and (3) and recalling that £, (zi,...,z,) has a nonzero Zrcoefficient, we see that A, (zi,...,z,) 
also has a nonzero zrcoefficient. Therefore, for each i G {1 ,... ,k — 1}, the collection of polynomi¬ 
als {Ai(/jt_i(x)), ..., A;(/jt_i(x),... ,/jt_,(x))} is a basis for the linear subspace of Fg[x] spanned 

by 

Suppose X = {x,yi ,... S H Xu- Proceeding as in eq. (4), we find that 

£j{yi,...,yj_i,f^_j{x)) = yj = if {yi,... ,yj.i,f+_j{x)). 

Therefore there exists a linear form h{zi,... ,Zj-i) and scalars a,a^ G F^, where either h ^ 0 or 
a ^ or both, such that l 2 (yi,... ,i/y_i) +afk_j{x) — a^ff_j{x) = 0. By 3.6, 

^(Ai(/;t-i(^)), •••, \j-iifk-i{x),... ,fk-j+i{x))) + afk-jix) - a+ff_j{x) =0. (7) 

We claim that the left-hand side of eq. (7) is a nonzero polynomial. If Iz = 0, this is immediate 
because a 7 ^ fl+, whereas fk-j{x) and fjf_j{x) are both monic of degree d + k — j. Uh ^ 0, then by 
our observations about the polynomials {i^i{fk-i{x), ■ ■., fk-i{x))}, the first term on the left-hand 
side is a nonzero polynomial in the span of {fk-i{x), ■ ■ ■ ,fk-j+i{x)}- In particular, its degree is at 
least d + k — j + 1. The other two terms have degree at most d + k — j, which proves the claim. 

Thus, eq. (7) states that x is a root of a nonzero polynomial of degree at most d + k — 1, a fact 
we derived from the condition that x G n X„. By 3.7, |Xz H X^ | ^ d -|- /c — 1. □ 

Justifications for observations. For the sake of completeness, we formally justify the observa¬ 
tions made in the course of the just-concluded proof. 3.6 can be proved by induction on i. When 
i = 1, eq. (2) specialises to yi = ii{fk-i{x)), so we reach eq. (3) by taking Ai = £ 1 . For general i, 
by the induction hypothesis, we have 

Vi = £i{M{fk-i{x)),...,?^i-i{fk-i{x),...,fk-i+i{x)),fk-i{x )). ( 8 ) 

Each argument to ii in the above equation is a linear form in {fk-i{x),.. .,fk-i{x)}, and ij is itself 
a linear form. Taking A, to be the "composition" of these linear forms gives us eq. (3). 

3.7 is, as noted, a simple corollary to 3.6. 

We turn to the observation, made just after ( 6 ), that A, (zi,... ,Z/) has a nonzero zrcoefficient. 
Of the i arguments to £i, only the last involves fk-i{x), and that last argument is given a nonzero 
coefficient by the defining property of The other arguments are polynomials in the span of 
{/jc_i(x),... ,fk-i-\-i{x)}. The linear independence observed in ( 6 ) completes the justification. 

3.5 Pass/Approximation Tradeoff for Set Cover 

We now bring together our technical results to obtain a pass/approximation tradeoff for SET- 
COVER in the semi-streaming setting. 

Theorem 3.8 (Main result). Let c > 1 be a constant. Let Abe a p-pass streaming algorithm that, for 
all large enough n and m, approximates the optimum value o/SET-COVER„,m instances to a factor smaller 
than n^/(P+^)/(c(p -|- 1)^) with probability at least 2/3. Then A must use TliyT / p'^) bits of space. This 
space lower bound applies to instances with m = ©(n'^P). 
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Proof. Let cjhe a sufficiently large prime power. Put d — (c — 1) (p + 1), f = — 1/q), and 

n = qP^^. By Theorem 3.5, there exists a (p + l,d,f)-edifice over a universe X with \X\ = n. 
By Theorem 3.4, the space usage of A, which approximates SET-COVER to a factor better than 
n'^/(p+P /_|_ ^)(p _|_ is at least RP(mpJ 7 )/p, where T is a complete (p -|- l)-level f-ary tree. 
By Theorem 3.2, this space bound is Q(f/p^) = — lfq)f p^) = Ci{n‘^/ p^). 

Examining the reduction in Theorem 3.4 shows that instances of SET-COVER demonstrating the 
above lower bound have roughly as many sets as the edifice has leaves, i.e., m = ©(n'^P). □ 

It is instructive to note the following corollaries of Theorem 3.8. 

1. Let p be a constant. Then there exist positive constants a < 1 and j6 > 1 such that )- 

approximating SET-COVER in p streaming passes requires space. In particular, such 

an approximation is not possible for a semi-streaming algorithm. 

2. Every multi-pass semi-streaming O (log n)-approximation algorithm for SET-COVER requires 
p = Q(logn/ log log n) passes. 

3.6 Two-Player Communication Complexity of Set Cover 

Nisan [27] and Demaine et al. [10] have studied SET-COVER as a communication game. Our proof 
of Theorem 3.8 directly implies a lower bound for a certain mw/ff-player SET-COVER game. But one 
may wonder about implications for the more fundamental setting of two-player communication 
complexity. Our next theorem shows that our technology does indeed yield a new two-player 
result. 

In the two-player SET-COVER game, there is a fixed finite universe X = [n], Alice receives as 
input a collection X C 2^ , and Bob receives a collection Q C 2^. The players wish to solve the SET- 
COVER instance {X, 7^ U as cheaply as possible. Specifically, they must output a cover certificate 
(analogous to the array Coverer in Algorithm 1) that specifies, for each x E X, the set in U ^ that 
covers x. A communication protocol that gives such an output is said to be a-approximate if the 
implied set cover, Sol, satisfies |So/| ^ a|Opf|, where Opt is an optimum solution to the instance. 

By mimicking the standard offline greedy algorithm for SET-COVER, one readily obtains a 
(Inn — In Inn -|- 0(l))-approximate protocol that communicates at most n messages, each mes¬ 
sage being n bits long; in particular, the total communication cost is Nisan proved [27, The¬ 
orem 4] that for every constant <5 > 0, a (j — log 2 n-approximate protocol requires an amount of 
communication that is exponentially larger, roughly exp(-\/n) for small d. Nisan's theorem uses a 
reduction from SET-DISJOINTNESS and is therefore agnostic about the number of messages in the 
protocol. Our theorem complements this by giving a "bounded-round" lower bound. 

Theorem 3.9. Let c > 1 be a constant. Suppose there exists a (randomised) a-approximate protocol for 
the two-player SET-COVER game that communicates a total of C bits in at most r messages. Then either 
a ^ n^/(''+^V(c(r -|-1)^) or C = 0.(rf /r"^). 

Proof sketch. We encode an instance of POINTER-JUMPING on a tree as a SET-COVER instance, using 
our edifices, exactly as in the proof of Theorem 3.4. We then treat pointer-jumping as a two- 
player communication game, with Alice holding the information at vertices of the tree whose 
level is odd, and Bob holding the rest. Eor this two-player game, we invoke the bounded-round 
communication lower bound due to Klauck et al. [19] to finish the proof. □ 

While we could have used the above two-player version of pointer-jumping as the basis 
for a data-streaming lower bound, it is important to note that doing so would have considerably 
weakened the streaming result, because p streaming passes translate into 2p — 1 messages in a 
two-player protocol. 
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4 Extension to Partial Cover 


Thus far we have focused on the SET-COVER problem as traditionally defined, in which a feasible 
solution must cover the entire universe. However, as is the case with many optimisation problems, 
SET-COVER admits a relaxation in the form of a bicriterial approximation, wherein the feasibility 
constraint can be violated by some amount e, and we seek a solution with cost at most A:(e, n) 
times the optimum fully feasible solution, for some function a. 

To be precise, we consider the problem partial-COVER,;^„j,£, where an instance I = {X, T, e) 
consists of a universe X, with \X\ = n, a collection of sets X Q 2^ with | = m, and a parameter 

£ G [0,1]. The goal is to compute a (1 — £)-partial cover of X, defined as a collection Sol C X 
that covers at least (1 — £)\X\ elements. Such a solution Sol is said to be a-approximate if |SoZ| ^ 
ft;|Opf|—or, in the weighted version, w{Sol) ^ aw{Opt )—where Opt is a minimum-cost set cover 
for {X,X). Notice that we are comparing the cost of our partial cover with that of the best total 
cover. 

4.1 Upper Bound 

We begin with our most general upper bound, which includes partial covers and weighted sets. 
For convenience in stating the space bound, we assume that all weights are 0(log ?M)-bit integers. 

Theorem 4.1. For every integer p ^ 1, there is a p-pass, 0{n log m)-space algorithm for the weighted 
version of FAKnAL-COV^R„^m,£ that produces an a.{n, £)-approximate cost (1 — £)-partial cover, where 
cc(n,£) = min{8p£^ (8p -|- l)n^^^P+Ff 

Proof. We run the following two schemes in parallel, returning the lower-cost solution. First, we 
run the Emek-Rosen algorithm for p passes, each time obtaining a (1 — £^^P)-partial cover of 
the remaining (uncovered) portion of X, and each time adding at most 8e^^^Pa;(Opf) cost to our 
solution Sol. By definition of a partial cover, for each ] G [p], the collection of sets constituting Sol 
after j passes leaves at most elements uncovered. Therefore, in the end, Sol is a (1 — e)- 

partial cover. 

Second, we run the Emek-Rosen algorithm for p passes (again) but here, in each pass, ob¬ 
taining a (1 — l/n^^(P+^))-partial cover of the remaining (uncovered) portion of X, and each 
time adding at most 8n^^^f^^'iw{Opt) cost to our solution Sol. The collection of sets constitut¬ 
ing Sol after j passes leaves at most elements uncovered. After p passes, Sol cov¬ 

ers all but at most n^PP+P elements. Covering each of these with its cheapest-covering set— 
which the Emek-Rosen algorithm records—of cost at most w{Opt), leads to a total cost of at most 
(8p -|- l)n^^^P^Pw{Opt). Since this is a full cover of X, it is also a (1 — £)-partial cover. □ 

The above upper bound generalises Theorem 2.5, except for the constant "8" that arises in 
the Emek-Rosen analysis. One can tweak their algorithm so as to replace the 8 with (1 -|- b'f’, 
where 5 > 0 is a constant of our choice, at the cost of increasing the space usage by a factor of 
0(1/ log(l -|- ^)), which is about 0(^^^) for small b. 

4.2 Lower Bound 

We shall now show that Theorem 4.1 is asymptotically tight for every constant p by proving an ap¬ 
propriate lower bound on the approximation factor that a semi-streaming algorithm for PARTIAL- 
COVER can achieve. 0ur lower bound will hold even for unweighted PARTIAL-COVER and will 
match the upper bound of Theorem 4.1 up to a 0(p^) factor. 
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Our proof is based on edifices—as in the proof of Theorem 3.8 —except that we need a different, 
more complicated, setting of parameters that is not directly achieved by Theorem 3.5. Instead, 
we revisit the edifices constructed in the proof of Theorem 3.5 and observe that they have an 
additional geometric property that we call wideness: roughly speaking, each level contains many 
groups of mutually parallel varieties. Clustering these parallel classes into "supervarieties" gives 
us new edifices with the desired parameters. 

Let Ct{u) denote the set of children of a vertex w in a tree T. A {k, d, q, f)-edifice T is said to be 
{h, t')-wide if, for each non-leaf vertex u of T, there exist subsets Q Ct{u) such that 

(Wl) Vi,..., Vt> are pairwise disjoint; 

(W2) for all i G [T], |V;| = b; and 

(W3) for all i G [t'], for all / u' G V;, we have n Xj,/ = 0. 

Lemma 4.2. If there exists a {k,d,q,t)-edifice T on universe X that is {h,t')-wide, then there exists a 
{k, h^{d + k — 1), b^^^q, t')-edifice on the same universe X. 

Proof. The desired edifice is built by "merging" certain carefully chosen sets of vertices of T. 

Define the following colour-and-trim procedure on a vertex u of T. If u is a leaf, then do 
nothing. Otherwise, let Vi,..., Vr be subsets of Criu) satisfying (W1)-(W3). For each i G [T], for 
each V G V;, assign colour i to the edge from u to v. Delete all uncoloured edges out of u as well as 
the subtrees pointed to by these edges. Then recursively colour-and-trim the remaining vertices 
in Cr{u). 

Let T' be the fully edge-coloured {bt')-ary tree obtained by applying this colour-and-trim pro¬ 
cedure to r, the root of T. Reusing the varieties from T makes T' a {k, d, q, bt')-ediiice. 

For each vertex v of T', define the rainbow at v to be the sequence of colours on the unique path 
from r to v. Create a new edge-coloured rooted tree T” by merging vertices of T' that have the 
same rainbow into "supervertices" and defining the parent of a supervertex v" to be the vertex w" 
whose rainbow is obtained by deleting the last colour in the rainbow at v”; assign this deleted 
colour to the edge from w" to v". Property (W2) implies that each V; is nonempty; property (Wl) 
then implies that T” is a P-ary tree with k levels. 

For each vertex w" of T'', let (w") denote the set of vertices of T' that were merged to pro¬ 
duce m". Define the variety X^" Q X thus: 

Xu" = y Xu- 

ue{u") 

By (W3), the above union is indeed a disjoint union (denoted by "!+)"). 

We shall show that is the desired edifice. Properties (El), (E2), and (E3) are immediate. 
Property (E4) follows from the same property of T' and the observation that whenever two ver¬ 
tices of T' are merged in T”, so are their parents. For property (E5), first note that (W2) implies 
that at each level j G [k] there are exactly b^^i vertices of T' that have a particular rainbow. Thus, 
for each leaf z" of T" we have | {z") \ = b^^^. Using property (E5) of T', we have 


Xz" 


l+J X, = x: |Xz|^ E q = b^-^q. 

ze{z") ze{z") ze[z'] 


Finally, we address property (E6). As in the proof of Theorem 3.5, it suffices to upper-bound 
Xzn n Xj,//1, where z” is a leaf of T” and u” is a vertex of T" that is not an ancestor of z!', whereas 
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the parent y" of w" is. Suppose that w" is at level j < k. Then 

n Xu» = (" U x,") n r U = IJ (x, n x„). ( 9 ) 

^ze{z") ' ^ue{u") ' ze{z"), ue{u'') 

Since |(z")| = and \{u")\ = b^~i, this latter expression immediately leads to \X^ii n Xj,//| ^ 
jy 2 k-j-i using property (E 6 ) of T'. However, this upper bound is too weak; to strengthen 

it, we consider the structure of X^" and X„// more carefully 

Consider a generic z G (z") and a generic u G (w"). There must exist i/i,i /2 S (y") such that z 
is a descendant of yi and w is a descendant of y 2 . The crucial observation is that if yi 7 ^ y 2 , then 
by (W3), Xyj n Xy 2 = 0, whence by (E4), X^ n Xt, = 0 . Therefore the pair (z, u) contributes to the 
latter union in eq. (9) only when y\ = yi- Therefore, 

x,„nx,„= U U (x,nx„). 

ye{y") ze{z"), ue{u") 

z,u descendants of y 


Since | (y") \ = J ^ and each y G (y") has y descendants in (z") and b descendants in {u"), we 
obtain jx^// n Xj,h| ^ y~i~^yb{d + k — l) = ^{d + k — l) ^ b’^{d + k — 1) + k — 1, as required. □ 

Lemma 4.3. The {k, d, q, t)-edifice constructed in Theorem 3.5 is ( [dq }, [1/tiJ t/q)-widefor all S G (0,1]. 

Proof. It suffices to prove the lemma in the case ^ = 1; a little thought shows that the general case 
then follows as a corollary. 

Let T be the edifice constructed in Theorem 3.5. Let w be a non-leaf vertex of T, at level j + 1, 
where j G [k — 1]. Then the edges out of u are labelled by the t distinct rank-/ edificial equations. 
Let us call two such equations {ij : fk-j} and {pk ; similar if £j = tf and /jt_y — is a 
constant polynomial. This similarity relation then naturally extends to Cy-{u). Similarity is easily 
seen to be an equivalence relation, each of whose equivalence classes has size exactly |F^| = q. 
Therefore there are exactly t/q equivalence classes; let Vi,... ,Vt/q Q Ct{u) be these classes. 

To show that T is {q, t/ y)-wide, we shall show that these classes {V,} satisfy properties (Wl), 
(W2), and (W3). The first two properties are immediate. Eor the third, consider arbitrary v f v' E 
Vi, for some i. Then v and v' are similar, which means that a point x = (x, yi,..., yjt-i) G Xj, n Xj,/ 
must satisfy a pair of similar, but distinct, edificial equations. Let these equations be l£j : fk-j} and 
l^j ■ fk-jl - Consulting eq. (2), we find that 

0 = yy -Vj = ■ ■ -^yjJk-jix)) -ijiyi,..■,yj,fk-ii^)) = ■ ■ -^ojk-jix) - fA-ji^)). 

By definition, the linear form £j{zi,... ,Zj) has a nonzero z^-coeffident, implying that fk-j{x) — 
fk-j{^) = 0- This is a contradiction, because fy-j — f^-j is a nonzero constant polynomial. Therefore 
such a point x does not exist, i.e., X^ n X^i = 0 . □ 


Theorem 4.4. Letc > Ibea constant. Let Abe a p-pass streaming algorithm with the following guarantee. 
For all large enough n and m and all e G (0, M instances o/PARTlAL-COVER„^m,E/ probability 
at least 2/3, A returns the value of some a-approximate solution to the instance, where 


8c(p 1)2 


( 10 ) 


Then A must use Q(n‘^/p^) bits of space. In particular A cannot be semi-streaming. 
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Proof. This theorem is analogous to a combination of Theorems 3.4 and 3.8; the proof is along very 
similar lines. 

We may as well assume that e' ^ because if £ is too small for this to hold, then we 

simply consider the weaker problem of (1 — e')-partial covering, where e' = rrvK.v^P_ 

Pick a sufficiently large prime power q. Put n = qP+\ d = {c-l){p + l) + l,5 = and 

6 = \Sq] / q. By our assumption, we have 3 ^ 1/q and 3 ^ 3 ^ 23. 

Combining Theorem 3.5 with Lemmas 4.2 and 4.3 and working through some algebra, we find 
that there exists a (p + 1, {3q)P^^{d + p), {3q)Pq, [l/3\q‘^^P{l — l/(j))-edifice T over a universe X 
with IT"I = n. Using the varieties of T, we encode each instance tt of MPJ7- as a collection T{t[) of 
subsets of X exactly as in Theorem 3.4 and treat T{tc) as an instance of PARTlAL-COVER„^m,£:- As 
before, if MPj7-(7r) = 1, then I{n) admits a total cover using Qi := p -|- 1 sets. 

For the case MPj7-(7r) = 0, we refine the argument used for Theorem 3.4 as follows. Let Xi be 
the variety of T at the unique leaf, V\, in T\n- As before, the elements of Xi cannot be covered 
by sets corresponding to ancestors of V\, and each of the remaining sets in T(7r) can cover at 
most {3q)P^^{d -|- p) such elements. Every (1 — £)-partial cover must, in particular, cover at least 
|Xi| — £|A| elements of Xi. It follows that the cheapest such partial cover uses at least Qo := 
(|Xi| — e\X\)/{{3q)P^^{d + p)) sets. Now, 

Qo ^ |Xi| -elTfl ^ {3q)Pq - eqP+^ 

Qi {3q)P+^{p + d){p + l) ^ {3q)P+'^{p + d){p + 1) 

_ (3q)Pq - \3PqP+^ 

{3q)P+^c{p -I- 1)2 

1 

^ 23c{p + l)^ 

1 t-^'P 

'' 4(2£)l/p ■ c(p X 1)2 ^ 8c(p -|- 1)2 ' 

where (11) uses the parameters of the edifice T, (12) uses 3 X^3, and (13) uses 3 ^ 23. 

Therefore, eq. (10) gives a < Qo/Qi- As in Theorem 3.4, with an approximation this good, A 
can be used to determine MPj7-(7r) and must consequently use Q(f/p^) bits of space, where t is 
the arity of T. Since f = [l/3\q'^^P{l — 1/q) = Q(n‘^), this space lower bound is Q(n^/p^). □ 

5 Discussion 

We conclude with a more technically detailed description of selected results from previous work, 
with the goal of shedding more light on some of our own results. 

In the external-memory setting, without a streaming restriction, an eager implementation of 
the greedy algorithm involves an inverted index and a priority queue of set sizes. Unfortunately, 
this involves arbitrary (non-local) memory accesses, leading to poor performance. 

Relaxing the strict greedy requirement, Cormode, Karloff, and Wirth add a set to the solution 
if its contribution is at least l/j6 times the best [9]. So that all disk accesses are sequential, initially 
they allocate sets to "buckets" (files) according to their size, with a bucket for each range [ff, 
i = 0,..., K, where k = max [log^ | S; | J. Starting from j = k down to 0, as each set in bucket j is 
examined, sequentially, set S/ is added to Sol only if its contribution is at least /3h otherwise, {i, Si \ 
C) is appended to the appropriate bucket. This is essentially the same thresholding as Algorithm 1, 
with the same pass/approximation tradeoff, but implemented so that the total amount of data 
handled is 0{fi/ (/3 — 1)) times the input size. 


( 11 ) 

( 12 ) 

(13) 
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Blelloch, Simhadri, and Tangwongsan solve very large set cover instances on disk and in par¬ 
allel in RAM [5]. They consider situations in which there is less than one word of memory per 
element. Their pre-bucketing is much like the geometric ranges of DFG, and their MaNIS scheme 
appears to be a randomised, and parallelisable, version of the pass through the sets in a bucket. 

The Emek-Rosen scheme [12] is in some sense like DFG in its having a hierarchy of thresholds 
that are powers of 2. Its purpose however, is to facilitate partial covers with (item and) set costs. 
In the unweighted setting, as each set S is seen, it is deemed to cover some subset T C S, where 
2' ^ |T| < 2'+^, if each element in T was previously covered by some subset of size < 2', or was 
previously uncovered. This is somewhat like all the runs of DFG with j6 = 2 being folded into one. 
In parallel, the scheme records the cheapest set that covers each item (amongst equal-cheapest, 
choose the first that occurs in the stream). This step is similar to the folding in Algorithm 2. 

We contrast the threshold chosen in our algorithm with that in the Emek-Rosen algorithm. In 
our two-pass algorithm (folded into one), z = ^/n, leading to a 2^/n approximation (in fact, | SoZ | ^ 
\/n{l + |Opf|)). Once the stream is done, the Emek-Rosen algorithm can choose a threshold t = 
2*. Items that are recorded as covered by some such T C S, with |T| ^ r, are certified to be 
covered by S; those "below the threshold" are instead covered by their cheapest set. This way, at 
most 0{n/ t) T-sets are chosen and 0{Tw{0pt)) elements are cheapest-set covered. Since such a 
cheapest set has cost at most w{Opt), by setting this threshold t to be approximately m/wiOpt), 
the algorithm returns an 0{w{0pt)/e) weight solution. Of course, we do not know w{Opt), but 
it suffices to choose the largest r so that at most en elements are cheapest-set covered. When 
£ ^ '^1 \/n however, it is better to choose z to leave at most n cheapest-set covered elements, 
hence z = ©{\/n/zv{Opt)). 

This tradeoff allows the Emek-Rosen algorithm to account for set weights. In the unweighted 
case, however, our solution has at most \/n(l -|- \Opt\) sets, whereas the Emek-Rosen solution 
has at most ^yn{l + 8|OpZ|) sets. As mentioned in Section 4.1, the latter expression can become 
arbitrarily close, i.e., \/n(l -h (1 -h S)^\Opt\), with space increasing by a factor oiO{\/5). 
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